on behalf of EUCanImage working group
Abstract:Composed of multiple interconnected pixels controlled by on/off RF switches, the pixel antenna can generate reconfigurable radiation patterns that can be further exploited to construct diverse pilot sequences for effective channel estimation. However, such pilot sequences inherently have rank deficiency, making it difficult to effectively and efficiently acquire the full channel state information (CSI) across all available radiation patterns. To tackle this difficulty, we consider a sparse environment with a limited number of propagation paths for a pixel antenna system, where a user equipped with a pixel antenna transmits only a limited number of pilots to recover the CSI under all radiation patterns. The proposed algorithm exploits the limited number of propagation paths that are invariant with the pixel antenna patterns, and then formulates the full channel estimation as a sparse recovery problem in the angular domain solved by Generalized Approximate Message Passing (GAMP). Moreover, to mitigate the rank deficiency of pilot sequences, we additionally incorporate a Multipath Matching Pursuit (MMP) algorithm for robust initialization. The overall proposed scheme, termed MMP-GAMP, achieves higher estimation accuracy than other algorithm baselines, while requiring lower pilot overhead.
Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student's inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student's on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.
Abstract:Recent advances in Vision-Language Models (VLMs) have benefited from Reinforcement Learning (RL) for enhanced reasoning. However, existing methods still face critical limitations, including the lack of low-level visual information and effective visual feedback. To address these problems, this paper proposes a unified multimodal interleaved reasoning framework \textbf{ForeSight}, which enables VLMs to \textbf{See Further} with low-level visual cues and \textbf{Think Deeper} with effective visual feedback. First, it introduces a set of low-level visual tools to integrate essential visual information into the reasoning chain, mitigating the neglect of fine-grained visual features. Second, a mask-based visual feedback mechanism is elaborated to incorporate visual reflection into the thinking process, enabling the model to dynamically re-examine and update its answers. Driven by RL, ForeSight learns to autonomously decide on tool invocation and answer verification, with the final answer accuracy as the reward signal. To evaluate the performance of the proposed framework, we construct a new dataset, Character and Grounding SalBench (CG-SalBench), based on the SalBench dataset. Experimental results demonstrate that the ForeSight-7B model significantly outperforms other models with the same parameter scale, and even surpasses the current SOTA closed-source models on certain metrics.
Abstract:While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
Abstract:Latent diffusion models have enabled high-quality video synthesis, yet their inference remains costly and time-consuming. As diffusion transformers become increasingly efficient, the latency bottleneck inevitably shifts to VAE decoders. To reduce their latency while maintaining quality, we propose a universal acceleration framework for VAE decoders that preserves full alignment with the original latent distribution. Specifically, we propose (1) an independence-aware channel pruning method to effectively mitigate severe channel redundancy, and (2) a stage-wise dominant operator optimization strategy to address the high inference cost of the widely used causal 3D convolutions in VAE decoders. Based on these innovations, we construct a Flash-VAED family. Moreover, we design a three-phase dynamic distillation framework that efficiently transfers the capabilities of the original VAE decoder to Flash-VAED. Extensive experiments on Wan and LTX-Video VAE decoders demonstrate that our method outperforms baselines in both quality and speed, achieving approximately a 6$\times$ speedup while maintaining the reconstruction performance up to 96.9%. Notably, Flash-VAED accelerates the end-to-end generation pipeline by up to 36% with negligible quality drops on VBench-2.0.
Abstract:In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
Abstract:Colorectal liver metastases (CRLM) are a major cause of cancer-related mortality, and reliable detection on CT remains challenging in multi-centre settings. We developed a foundation model-based AI pipeline for patient-level classification and lesion-level detection of CRLM on contrast-enhanced CT, integrating uncertainty quantification and explainability. CT data from the EuCanImage consortium (n=2437) and an external TCIA cohort (n=197) were used. Among several pretrained models, UMedPT achieved the best performance and was fine-tuned with an MLP head for classification and an FCOS-based head for lesion detection. The classification model achieved an AUC of 0.90 and a sensitivity of 0.82 on the combined test set, with a sensitivity of 0.85 on the external cohort. Excluding the most uncertain 20 percent of cases improved AUC to 0.91 and balanced accuracy to 0.86. Decision curve analysis showed clinical benefit for threshold probabilities between 0.30 and 0.40. The detection model identified 69.1 percent of lesions overall, increasing from 30 percent to 98 percent across lesion size quartiles. Grad-CAM highlighted lesion-corresponding regions in high-confidence cases. These results demonstrate that foundation model-based pipelines can support robust and interpretable CRLM detection and classification across heterogeneous CT data.
Abstract:The orthogonal time frequency space (OTFS) signal is considered a promising solution for high-mobility wireless environments. It manages Doppler effects by utilizing delay-Doppler (DD) domain processing. However, the relatively long OTFS frame duration could introduce considerable sensing or communication latency when radar and communication are performed separately. By operating in a dual-functional radar and communication (DFRC) mode, the OTFS system performs sensing and data transmission simultaneously, thereby reducing the resulting latency. Nevertheless, the optimal OTFS DFRC signal strategy remains insufficiently explored. This paper investigates the optimal signal design for OTFS DFRC systems, focusing on pilot symbol design and data symbol power allocation. Specifically, we derive a channel capacity lower bound metric for communication that considers channel estimation errors in OTFS. For sensing, we derive an integrated sidelobe level (ISL), accounting for the randomness of the data symbols alongside the deterministic pilot symbols. Leveraging the above metrics, we formulate an optimization problem that balances radar and communication performance, and then solve it using an alternating optimization framework. We validate the proposed signal through numerical analysis and Monte Carlo simulations. Our analysis shows that OTFS DFRC enforces a deterministic pilot signal that is characterized by a concentrated peak in the DD domain, which furnishes a common structure in the DD domain facilitating sensing and channel estimation, with data multiplexed in other DD grids, thereby unifying sensing and communication within a single OTFS signal. Compared with conventional OTFS signals, the proposed OTFS DFRC signal expands the achievable sensing-communication performance region, delivering at least a 9.45 dB ISL suppression for sensing and a 4.82 dB SINR ratio gain for communication.
Abstract:Low-Earth-orbit (LEO) satellite communication systems face challenges due to high satellite mobility, which hinders the reliable acquisition of instantaneous channel state information at the transmitter (CSIT) and subsequently degrades multi-user transmission performance. This paper investigates a downlink multi-user multi-antenna system, and tackles the above challenges by introducing orthogonal time frequency space (OTFS) modulation and rate-splitting multiple access (RSMA) transmission. Specifically, OTFS enables stable characterization of time-varying channels by representing them in the delay-Doppler domain. However, realistic propagation introduces various inter-symbol and inter-user interference due to non-orthogonal yet practical rectangular pulse shaping, fractional delays, Doppler shifts, and imperfect (statistical) CSIT. In this context, RSMA offers promising robustness for interference mitigation and CSIT imperfections, and hence is integrated with OTFS to provide a comprehensive solution. A compact cross-domain input-output relationship for RSMA-OTFS is established, and an ergodic sum-rate maximization problem is formulated and solved using a weighted minimum mean-square-error based alternating optimization algorithm that does not depend on channel sparsity. Simulation results reveal that the considered practical propagation effects significantly degrade performance if unaddressed. Furthermore, the RSMA-OTFS scheme demonstrates improved ergodic sum-rate and robustness against CSIT uncertainty across various user deployments and CSIT qualities.
Abstract:This paper proposes a novel pilot scheme for multi-user uplink channel estimation in extra-large-scale massive MIMO (XL-MIMO) systems with extremely large aperture arrays (ELAA). The large aperture of ELAA introduces spatial non-stationarity, where far-apart users have significantly distinct visibility at the antennas, thereby reducing inter-user interference. This insight motivates our novel pilot scheme to group users with distinct visibility regions to share the same frequency subcarriers for channel estimation, so that more users can be served with reduced pilot overhead. Specifically, the proposed pilot scheme employs frequency-division multiplexing for inter-group channel estimation, while intra-group users -- benefiting from strong spatial orthogonality -- are distinguished by shifted cyclic codes, similar to code-division multiplexing. Additionally, we introduce a sub-array structured ELAA, where each sub-array is a traditional MIMO array and treated as spatial stationary, while the distances between sub-arrays can be significantly larger to achieve an expanded aperture. The channel support for sub-arrays features clustered sparsity in the antenna-delay domain and is modeled by a 2-dimensional (2-D) Markov random field (MRF). Based on this, we propose a low-complexity channel estimation algorithm within a turbo Bayesian inference framework that incorporates the 2-D MRF prior model. Simulations show that the proposed scheme and algorithm allow the XL-MIMO system to support more users, and deliver superior channel estimation performance.